(Table 2: Summary Statistics of Heart Disease dataset)
Key Insights & Observations
Data Quality:
No missing or duplicate records
Some unrealistic values (e.g., cholesterol = 0) were cleaned prior to modeling
Gender Distribution:
79% male, 21% female
Potential influence on model fairness and generalization
Chest Pain Type:
54% of patients reported asymptomatic chest pain (ASY)
Indicates a high number of silent or undiagnosed cases
Heart Disease Distribution:
55.3% diagnosed with heart disease
44.7% undiagnosed
Fairly balanced for effective classification
Distribution of Features
Figure 1: Distribution of some Features
Code
# The distribution of 'Age' with a histogram - normal distributionageplot <-ggplot(data, aes(x = Age)) +geom_histogram(bins =30, fill ="blue", color ="black") +theme_minimal() +theme(plot.title =element_text(hjust =0.5)) +labs(title ="Age Distribution", x ="Age", y ="Count")# The distribution of 'Heart Disease' with a histogram - no class imbalancehdplot <-ggplot(data, aes(x = HeartDisease)) +geom_bar(fill ="blue", color ="black") +theme_minimal() +theme(plot.title =element_text(hjust =0.5)) +scale_x_continuous(breaks=c(0,1)) +labs(title ="Heart Disease Class Distribution", x ="Heart Disease", y ="Count")# The distribution of 'Sex' with a histogram - imbalance: ~4x more males than femalessplot <-ggplot(data, aes(x = Sex)) +geom_bar(fill ="blue", color ="black") +theme_minimal() +theme(plot.title =element_text(hjust =0.5)) +labs(title ="Sex Distribution", x ="Sex", y ="Count")# The distribution of 'Cholesterol' with a histogram - over 150 records with a cholesterol of 0; otherwise normal distributioncplot <-ggplot(data, aes(x = Cholesterol)) +geom_histogram(fill ="blue", color ="black") +theme_minimal() +theme(plot.title =element_text(hjust =0.5)) +labs(title ="Cholesterol Distribution", x ="Cholesterol", y ="Count")grid.arrange(hdplot, splot, ageplot, cplot, ncol =2, nrow=2)
Distribution/trends which are relevant in predicting heart diseases.
Distribution Explanations
Heart Disease Distribution (Top-Left): The distribution is appropriately balanced, minimizing the chances of bias in the model.
Sex Distribution (Top-Right): The dataset has more male patients than female, which might impact predictions.
Age Distribution (Bottom-Left): Most patients fall within the 40-70 years age range, which may reduce model accuracy for younger individuals.
Cholesterol Distribution (Bottom-Right): The presence of zero values in cholesterol is unrealistic, indicating the need for data cleaning.
# correlation matrix for numeric featureslibrary(corrplot)numeric_data <- data %>%select(where(is.numeric))cor_matrix <-cor(numeric_data)corrplot(cor_matrix, method ="circle", type ="upper", tl.col ="black", tl.cex =0.7, addCoef.col ="black")
Correlation Matrix Insights
Figure 2 shows correlations of features in the data set with heart disease.
Oldpeak (0.40) stands out since higher ST depression is associated with heart disease risk. Oldpeak refers to ST depression, a measurement on an ECG, indicating reduced blood flow to heart.
Patients who have heart disease are more likely to have lower maximum heart rate (MaxHR (-0.40)).
Age (0.28) and Fasting Blood Sugar (0.27) also emerged as positive correlates, confirming that older people and people with high fasting blood sugar levels are at risk.
Modeling and Results
Begin with performing any necessary cleaning and preprocessing of the data.
We will then use the decision tree algorithm to demonstrate how a decision tree works, and show the performance of one tree.
The next step will be using the random forest algorithm, which is a combination of decision trees, to see how it performs in comparison, ideally providing a more accurate prediction.
Data Preprocessing and Cleaning
No null or NA missing values found in the data set
One row with a RestingBP = 0
172 rows with Cholesterol = 0
We decided to drop these rows from the data set, as they were missing valid data.
Data Encoding
The next step is to encode the data.
The random forest alogorithm, along with most machine learning algorithms, functions best with numerical values.
Use one-hot encoding to transform the categorical variables into a binary column that indicates the presence (1) or absence (0) of the category.
Encoded Data Preview
Age
SexF
SexM
ChestPainTypeASY
ChestPainTypeATA
ChestPainTypeNAP
ChestPainTypeTA
RestingBP
Cholesterol
FastingBS
RestingECGLVH
RestingECGNormal
RestingECGST
MaxHR
ExerciseAnginaN
ExerciseAnginaY
Oldpeak
ST_SlopeDown
ST_SlopeFlat
ST_SlopeUp
HeartDisease
40
0
1
0
1
0
0
140
289
0
0
1
0
172
1
0
0
0
0
1
0
49
1
0
0
0
1
0
160
180
0
0
1
0
156
1
0
1
0
1
0
1
37
0
1
0
1
0
0
130
283
0
0
0
1
98
1
0
0
0
0
1
0
Splitting Data
We split the data set into training and test subsets.
The training subset will contain 70% of the data.
The test subset will contain 30% of the data.
Model Fitting and Prediction
Decision Tree
We first demonstrate how a single decision tree would look for our data set.
We achieved an accuracy of 81.7% without hyperparameter tuning.
The decision tree can be followed to determine what the predicted end result would be.
For example, if the patient has ST_SlopeUp = 1 and ChestPainTypeASY = 0; they likely do not have heart disease. If the patient has ST_SlopeUp = 0, MaxHR < 151, SexF = 0, then the patient likely does have heart disease.
Decision Tree
Random Forest
After gaining an understanding of how a single decision tree functions; we proceed with the bulk of our analysis using the random forest algorithm.
We trained the random forest using 100 trees and graphed the decision tree and evaluation metrics below. The random forest achieved an accuracy of 88.4%, which is higher than the 81.7% obtained from the single decision tree, as expected.
The confusion matrix shows that there were 104 true negatives (HeartDisease=0), 94 true positives (HeartDisease=1), 13 false negatives, and 13 false positives. Precision, recall, sensitivity, and F1 scores were all the same.
Confusion Matrix - Default
Hyperparameter Tuning
After achieving an accuracy of 88.4% on the initial random forest model built using default parameters, we used hyperparameter tuning to improve the model further.
Tuning the model is a crucial step in machine learning because the default values will not be the most accurate or generalizable (Probst, Wright, and Boulesteix 2019).
We conducted model enhancement by implementing 5-fold cross-validation in the process of tuning the key parameter mtry, which is the number of variables that are randomly chosen at every split of the tree.
For the cross-validation, the training data was divided into five sets, where the model was trained on four sets and validated on the remaining part.
The cycle was executed for every fold, aggregated performance across folds, and used the Area Under the ROC Curve (AUC) to serve as the main optimization metric.
The AUC was chosen because it assesses a model’s performance and does not take into account classification thresholds, which matters greatly in binary classification problems with an imbalanced dataset.
We evaluated mtry parameters from 2 through 5. Based on analysis, we discovered that mtry=3 achieved the greatest mean AUC (0.9302) which suggested that this setting provided the optimal compromise between overfitting and underfitting (Oshiro, Perez, and Baranauskas 2012).
Confusion Matrix - Tuned Model
Area Under Curve
(Figure 6: ROC curve with an AUC score of both basic RF Model & tuned RF Model)
Analysis Summary
These numbers demonstrate that the tuned model not only does well on the training folds but can also be expected to perform well on unseen data.
The ROC (Receiver Operating Characteristic) curve displays the balance of sensitivity (true positive rate) against specificity (false positive rate) for different thresholds.
The ROC curve indicates strong separation between classes, emphasizing a steep rise towards the top left corner which implies high sensitivity and low false positive rate.
The basic random forest model had AUC score of .8837 and the tuned random forest model’s ROC curve with an AUC score of 0.9371.
This shows how hyperparameter tuning can optimize accurately and reliably diagnosing heart disease.
The AUC value, derived from the ROC curve, indicates strong discriminatory ability.
An AUC above 0.90 is typically considered excellent(Fawcett 2006).
Feature Importance
We also looked for feature importance, which showed ST_SlopeUp, ChestPainTypeASY, and ST_SlopeFlat to be some of the most important predictors.
This is consistent with medical domain knowledge since changes in ST segments and chest pain types are known markers of cardiac abnormality (Khalilia, Chakraborty, and Popescu 2011).
The model’s ability to capture meaningful physiological patterns is also supported by the high ranking of MaxHR and Oldpeak.
Feature Importance
(Figure 7. Feature importance based on the average decrease in Gini index.)
Conclusion
The aim of the study was to evaluate the Random Forest algorithm.
Using a data set to predict heart disease, we compared the results using default settings against a model that was optimized after hyperparameter tuning.
Both models provided satisfactory classification properties (>80% accuracy)
the optimized model significantly showed improvement when compared against the initial model when comparing AUC values
The tuned model had an accuracy of 88.4% and sensitivity of 94.9% with the corresponding F1 score of 0.892.
These results demonstrate the importance of tuning in improving the performance of the model.
This is especially true for clinical decision-support systems where reducing false negatives is critical.
From a practical perspective, this study has important implications for the use of Random Forests in clinical decision support.
The algorithm’s robust nature and ability to handle various data types makes it an excellent algorithm for healthcare applications.
The performance gain after tuning implies that we should not use a Random Forest with default hyperparameters in a real-world machine learning deployment, but it can be used as a baseline for beginning evaluation or determining which model type may be most appropriate.
Random Forest algorithm is an effective algorithm in medical prediction tasks, as well as various classification and regression problems.
Random Forest can provide healthcare professionals with substantial support in terms of early diagnostics and risk stratification if it is tuned correctly, which can lead to improved patient outcomes and optimal use of clinical resources.
Khalilia, Mohammed, Sounak Chakraborty, and Mihail Popescu. 2011. “Predicting Disease Risks from Highly Imbalanced Data Using Random Forest.”BMC Medical Informatics and Decision Making 11: 1–13. https://doi.org/10.1186/1472-6947-11-51.
Oshiro, Thais Mayumi, Pedro Santoro Perez, and José Augusto Baranauskas. 2012. “How Many Trees in a Random Forest?” In Machine Learning and Data Mining in Pattern Recognition: 8th International Conference, MLDM 2012, Berlin, Germany, July 13-20, 2012. Proceedings 8, 154–68. Springer. https://doi.org/10.1007/978-3-642-31537-4_13.